AITopics | arc challenge

Collaborating Authors

arc challenge

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Query Circuits: Explaining How Language Models Answer User Prompts

Wu, Tung-Yu, Barez, Fazl

arXiv.org Artificial IntelligenceSep-30-2025

Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capability circuits (e.g., indirect object identification), but not why the model answers a specific input query in a particular way. We introduce query circuits, which directly trace the information flow inside a model that maps a specific input to the output. Unlike surrogate-based approaches (e.g., sparse autoencoders), query circuits are identified within the model itself, resulting in more faithful and computationally accessible explanations. To make query circuits practical, we address two challenges. First, we introduce Normalized Deviation Faithfulness (NDF), a robust metric to evaluate how well a discovered circuit recovers the model's decision for a specific input, and is broadly applicable to circuit discovery beyond our setting. Second, we develop sampling-based methods to efficiently identify circuits that are sparse yet faithfully describe the model's behavior. Across benchmarks (IOI, arithmetic, MMLU, and ARC), we find that there exist extremely sparse query circuits within the model that can recover much of its performance on single queries. For example, a circuit covering only 1.3% of model connections can recover about 60% of performance on an MMLU questions. Overall, query circuits provide a step towards faithful, scalable explanations of how language models process individual inputs.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.24808

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

EasyARC: Evaluating Vision Language Models on True Visual Reasoning

Unsal, Mert, Akkus, Aylin

arXiv.org Artificial IntelligenceJun-16-2025

Building on recent advances in language-based reasoning models, we explore multimodal reasoning that integrates vision and text. Existing multimodal benchmarks primarily test visual extraction combined with text-based reasoning, lacking true visual reasoning with more complex interactions between vision and language. Inspired by the ARC challenge, we introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction. EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning (RL) pipelines. The generators incorporate progressive difficulty levels, enabling structured evaluation across task types and complexities. W e benchmark state-of-the-art vision-language models and analyze their failure modes. W e argue that EasyARC sets a new standard for evaluating true reasoning and test-time scaling capabilities in vision-language models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.11595

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

Optimizing Pretraining Data Mixtures with LLM-Estimated Utility

Held, William, Paranjape, Bhargavi, Koura, Punit Singh, Lewis, Mike, Zhang, Frank, Mihaylov, Todor

arXiv.org Artificial IntelligenceJan-23-2025

Large Language Models improve with increasing amounts of high-quality training data. However, leveraging larger datasets requires balancing quality, quantity, and diversity across sources. After evaluating nine baseline methods under both compute-and data-constrained scenarios, we find token-count heuristics outperform manual and learned mixes, indicating that simple approaches accounting for dataset size and diversity are surprisingly effective. Building on this insight, we propose two complementary approaches: UtiliMax, which extends token-based heuristics by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by 200x Compared to manual (Groeneveld et al., 2024, OLMo), heuristic (Chung et al., 2023, UniMax), and learned (Xie et al., 2024, DoReMi) data mixes, UtiliMax leads to more compute efficient models that perform better on average across tasks. Large Language Model (LLM) pretraining data increasingly consists of sub-corpora from many sources covering multiple domains and varying in size (Gao et al., 2020; Du et al., 2022; TogetherAI, Work completed during an internship at Meta AI. FLOPs from Llama 70B on 2.1 million tokens needed for MEDU using the FLOP equations from Hoffmann et al. (2022) Unlike traditional multi-task learning scenarios, datasets are not necessarily aligned with a specific intended use. Moreover, "intended usage" is often multi-functional as LLMs are being developed for general-purpose functionality (Eloundou et al., 2024; Qin et al., 2023). Given multiple training corpora and multiple downstream goals, how should we sample from each corpus to get the best possible model? Prior work has explored heuristic (Rae et al., 2021; Soldaini et al., 2024) and learned (Xie et al., 2024; Albalak et al., 2023) approaches to solve this. However, there is minimal comparison between these methods using the same data and model configuration. Furthermore, it is unclear whether these approaches are robust to the impacts of epoching which is critical as frontier models are increasingly data-constrained (Villalobos et al., 2024; Longpre et al., 2024).

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.11747

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Middle East > Jordan (0.04)
(9 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Common Sense Is All You Need

Latapie, Hugo

arXiv.org Artificial IntelligenceJan-11-2025

Artificial intelligence (AI) has made significant strides in recent years, yet it continues to struggle with a fundamental aspect of cognition present in all animals: common sense. Current AI systems, including those designed for complex tasks like autonomous driving, problem-solving challenges such as the Abstraction and Reasoning Corpus (ARC), and conversational benchmarks like the Turing Test, often lack the ability to adapt to new situations without extensive prior knowledge. This manuscript argues that integrating common sense into AI systems is essential for achieving true autonomy and unlocking the full societal and commercial value of AI. We propose a shift in the order of knowledge acquisition emphasizing the importance of developing AI systems that start from minimal prior knowledge and are capable of contextual learning, adaptive reasoning, and embodiment -- even within abstract domains. Additionally, we highlight the need to rethink the AI software stack to address this foundational challenge. Without common sense, AI systems may never reach true autonomy, instead exhibiting asymptotic performance that approaches theoretical ideals like AIXI but remains unattainable in practice due to infinite resource and computation requirements. While scaling AI models and passing benchmarks like the Turing Test have brought significant advancements in applications that do not require autonomy, these approaches alone are insufficient to achieve autonomous AI with common sense. By redefining existing benchmarks and challenges to enforce constraints that require genuine common sense, and by broadening our understanding of embodiment to include both physical and abstract domains, we can encourage the development of AI systems better equipped to handle the complexities of real-world and abstract environments.

ai system, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.06642

Genre: Research Report (0.50)

Industry:

Information Technology (0.88)
Health & Medicine > Therapeutic Area (0.69)
Transportation > Ground > Road (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.88)
(4 more...)

Add feedback

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

Borchmann, Łukasz

arXiv.org Artificial IntelligenceDec-23-2024

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.17758

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

OpenAI's o3 model aced a test of AI reasoning – but it's still not AGI

New ScientistDec-20-2024, 23:10:58 GMT

OpenAI's new o3 artificial intelligence model has achieved a breakthrough high score on a prestigious AI reasoning test called the ARC Challenge, inspiring some AI fans to speculate that o3 has achieved artificial general intelligence (AGI). But even as ARC Challenge organisers described o3's achievement as a major milestone, they also cautioned that it has not won the competition's grand prize – and it is only one step on the path towards AGI, a term for hypothetical future AI with human-like intelligence. The o3 model is the latest in a line of AI releases that follow on from the large language models powering ChatGPT. "This is a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models," said François Chollet, an engineer at Google and the main creator of the ARC Challenge, in a blog post. How does ChatGPT work and do AI-powered chatbots "think" like us? Chollet designed the Abstraction and Reasoning Corpus (ARC) Challenge in 2019 to test how well AIs can find correct patterns linking pairs of coloured grids. Such visual puzzles are intended to make AIs demonstrate a form of general intelligence with basic reasoning capabilities.

agi, arc challenge, openai, (13 more...)

New Scientist

Country:

North America > United States > Oregon (0.05)
North America > United States > New Mexico (0.05)

Genre: Contests & Prizes (0.36)

Industry: Information Technology (0.52)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.67)

Add feedback

A gentle push funziona benissimo: making instructed models in Italian via contrastive activation steering

Scalena, Daniel, Fersini, Elisabetta, Nissim, Malvina

arXiv.org Artificial IntelligenceNov-27-2024

Adapting models to a language that was only partially present in the pre-training data requires fine-tuning, which is expensive in terms of both data and computational resources. As an alternative to fine-tuning, we explore the potential of activation steering-based techniques to enhance model performance on Italian tasks. Through our experiments we show that Italian steering (i) can be successfully applied to different models, (ii) achieves performances comparable to, or even better than, fine-tuned models for Italian, and (iii) yields higher quality and consistency in Italian generations. We also discuss the utility of steering and fine-tuning in the contemporary LLM landscape where models are anyway getting high Italian performances even if not explicitly trained in this language.

activation, benchmark, fine-tuning, (14 more...)

arXiv.org Artificial Intelligence

2411.18247

Country:

North America > United States > Wisconsin (0.04)
North America > United States > Ohio (0.04)
North America > United States > Michigan (0.04)
(11 more...)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

EuroLLM: Multilingual Language Models for Europe

Martins, Pedro Henrique, Fernandes, Patrick, Alves, João, Guerreiro, Nuno M., Rei, Ricardo, Alves, Duarte M., Pombal, José, Farajian, Amin, Faysse, Manuel, Klimaszewski, Mateusz, Colombo, Pierre, Haddow, Barry, de Souza, José G. C., Birch, Alexandra, Martins, André F. T.

arXiv.org Artificial IntelligenceSep-24-2024

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

arxiv preprint arxiv, dataset, eurollm-1, (11 more...)

arXiv.org Artificial Intelligence

2409.16235

Country:

Europe > Spain (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)
Asia > Thailand > Phuket > Phuket (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre: Research Report (0.83)

Industry: Government > Regional Government > Europe Government (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

An Approach to Solving the Abstraction and Reasoning Corpus (ARC) Challenge

Min, Tan John Chong

arXiv.org Artificial IntelligenceJun-6-2023

We utilise the power of Large Language Models (LLMs), in particular GPT4, to be prompt engineered into performing an arbitrary task. Here, we give the model some human priors via text, along with some typical procedures for solving the ARC tasks, and ask it to generate the i) broad description of the input-output relation, ii) detailed steps of the input-output mapping, iii) use the detailed steps to perform manipulation on the test input and derive the test output. The current GPT3.5/GPT4 prompt solves 2 out of 4 tested small ARC challenges (those with small grids of 8x8 and below). With tweaks to the prompt to make it more specific for the use case, it can solve more. We posit that when scaled to a multi-agent system with usage of past memory and equipped with an image interpretation tool via Visual Question Answering, we may actually be able to solve the majority of the ARC challenge

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.03553

Country: Asia > Singapore (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback